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Abstract —Unsafe websites consist of malicious as well as 
inappropriate sites, snch as those hosting questionable or offensive 
content. Website repntation systems are intended to help ordinary 
nsers steer away from these nnsafe sites. However, the process of 
assigning safety ratings for websites typically involves hnmans. 
Consequently it is time consuming, costly and not scalable. This 
has resulted in two major problems: (i) a significant proportion of 
the web space remains nnrated and (il) there is an unacceptable 
time lag before new websites are rated. 

In this paper, we show that by leveraging structural and 
content-based properties of websites, it is possible to reliably and 
efficiently predict their safety ratings, thereby mitigating both 
problems. We demonstrate the effectiveness of our approach using 
fonr datasets of up to 90,000 websites. We use ratings from Web 
of Trust (WOT), a popular crowdsourced web reputation system, 
as ground truth. We propose a novel ensemble classification 
technique that makes opportunistic use of available structural and 
content properties of webpages to predict their eventual ratings in 
two dimensions nsed by WOT: trustworthiness and child safety. 
Onrs is the first classification system to predict such subjective 
ratings and the same approach works equally well in identifying 
malicions websites. Across all datasets, our classification performs 
well with average Fi-score in the 74-90% range. 

I. Introduction 

The Internet has revolutionized the way we communicate 
today and has already become an integral part of our daily 
lives. The immense popularity of the Internet, with an increas¬ 
ing user base of billions, has naturally attracted miscreants. 
They set up various types of “unsafe” websites to lure their 
victims. These include malicious sites, intended for phishing, 
drive-by-downloads of malware etc. as well as sites that 
are inappropriate in some sense. Examples include websites 
hosting offensive, objectionable, hateful or illegal content, and 
misusing private user data. 

A variety of mechanisms have been developed for steer¬ 
ing unsuspecting users away from unsafe websites. Popular 
browsers present interstitial security warnings when users 
attempt to navigate to a known malicious website [1]. Several 
anti-virus vendors maintain website reputation systems (e.g., 
TrustedSource*. These systems use a combination of machine 
learning techniques and manual expert evaluation to arrive 
at the rating for a given website. A popular sub-category of 
reputation systems are those that make use of input ratings that 
are crowdsourced from the users of the system. PhishTank^ and 
Web of Trust (WOT)^ are examples of web reputation systems 


* http://www.trustedsource.org/ 

^http;//www.phishtank.com/ 

^https://www.mywot.com/ 



Fig. 1. Cumulative availability (%) of WOT reputation ratings along two 
dimensions (trustworthiness and child safety) for the one million most popular 
webpages as of July 2014 collected from Alexa (http://www.alexa.com). The 
plot highlights that often (over 58%) webpages lack reputation ratings in at 
least one dimension to indicate their safety. 

that rely fully or partly on crowdsourced ratings. An advantage 
of crowdsourced ratings is that the ratings can cover a broader 
class of unsafe websites, including those that are perceived 
to be inappropriate but not outright malicious [10]. Typically 
these rating systems are queried by dedicated browser exten¬ 
sions which can signal the result to the user in the form of a 
color-coded glyph, e.g., a red glyph indicating an unsafe site 
and a green glyph indicating a safe site (see Section III). 

All website reputation rating systems, especially those that 
involve humans in the rating process, suffer from two major 
disadvantages: 

• Insufficient coverage: Often the webspace coverage 
by a reputation system is limited. This is due to the 
high cost and poor scalability of obtaining experts’ 
ratings, as well as the lack of motivation for users to 
participate actively in rating webpages. This problem 
is illustrated in Eig. 1, which shows the availability of 
WOT reputation ratings for the one million most pop¬ 
ular webpages. In WOT websites are rated according 
to two dimensions, trustworthiness and child safety, 
which are both subjective. A majority of the webpages 
are unrated: 58% for trustworthiness and 66% for child 
safety. 

• Time lag: The time gap between a new website coming 
online and the system assigning a rating to it can 
be long. Often the gap can be in the order of days 
to months. This is particularly problematic because 
unsafe websites tend to be short-lived, with lifetimes 
often in the order of hours or days [27]. 

A consequence of these drawbacks is that users who rely on 
such reputation systems to protect them from unsafe contents 










remain vulnerable when many unsafe websites are unrated. 
Regardless of the concerns as to whether such reputation 
systems based on crowdsourced ratings are effective [26], the 
sheer number of users who rely on such systems (see Section 
III) warrants solutions to mitigate this vulnerability. 

Rating systems that make use of machine learning tech¬ 
niques, provided that they are fast, can address these short¬ 
comings. Machine learning has already been extensively used 
for detecting malicious websites based on the structure and 
content of web pages [9], [11], [30]. It is plausible that similar 
techniques can also be applied to predicting the future rating 
of a website in subjective dimensions as used in systems 
like WOT. Thus, our work addresses the following research 
question; Can we reliably predict the eventual rating of an 
unrated website? 

In this paper, we describe LookAhead, the system we built 
in the process of investigating this question. LookAhead uses a 
combination of structural and content-based features to predict 
the eventual rating a website is likely to receive. In reality, 
not all feature types can be extracted from all webpages (see 
Section IV) and to mitigate this feature unavailability problem 
we propose an ensemble classification approach. Accordingly, 
we train different classifiers for each feature type and present 
different combination strategies to estimate the overall rating. 
For the structure of the websites, we consider HTML and 
JavaScript-based features. However, we show that structural 
features alone would not be sufficient for accurate predictions. 
Therefore, we introduce a novel content-based feature set, that 
is extracted from the malicious outward links and the text 
present on a webpage. 

We make the following contributions: 

• Content-based features for effectively predicting the 
future rating of a website. In particular, we propose 
a novel use of the empirical cumulative distribution 
function (ECDF) as a feature set to extract clues 
about the content of a web page based on ratings of 
hyperlinks embedded in it (Section IV-Bl). We also 
propose how topic modeling techniques can be used to 
extract features that capture the theme of a webpage 
(Section IV-B2). 

• LookAhead: an adaptive ensemble classification 
technique effectively combining several classifiers for 
structural and content-based feature sets by learning 
combination weights from data (Section IV-C). 

• Systematic evaluation of LookAhead on several 
datasets with up to 90, 000 web pages. (Section V) We 
also evaluate the performance of Prophiler [9], which 
uses only structural features, on the those datasets. 
We show that the performance is significantly (statis¬ 
tically) better than using only the structural features of 
web pages, as in Prophiler (Section VI). In particular, 
this holds across both subjective dimensions (trustwor¬ 
thiness and child safety), as well as maliciousness. 

H. Related Work 

A typical approach for helping users avoid malicious web¬ 
sites is to make use of blacklists of known bad websites. For 
example, Microsoft’s Internet Explorer and Mozilla Eirefox 


web browsers warn users when they try to visit a page present 
on a blacklist. Unfortunately, blacklists suffer from a number 
of shortcomings, e.g., blacklists are required to be updated 
periodically, are often slow to reflect new malicious websites, 
and have poor coverage of malicious webspace. Moreover, 
adversaries often try to masquerade malicious webpages as be¬ 
nign by making slight modifications to their URLs. To mitigate 
problems with blacklists, Eelegyhazi et al. propose a system 
that, given an initial blacklist of domains, tries to predict 
potentially malicious domains based on nameserver features 
and registration information [17]. Prakash et al. propose five 
different heuristics that allow synthesizing new URLs from 
existing ones. The authors use this idea to enlarge the existing 
blacklist of malicious URLs [29]. 

Going beyond blacklists, application of machine learning 
techniques to successfully identify malicious websites has be¬ 
come popular. Ma et al. [23] explore the use of lexical features, 
including the length and number of dots in URLs, host-based 
features, such as IP address, domain name and other data 
returned by WHOIS queries [14] to identify malicious web 
links. They evaluated their approach using 20,000 URLs drawn 
from different sources, specifically benign URLs are collected 
from DMOZ Open Directory Project'^ and Yahoo’s directory^, 
and malicious URLs from PhishTank and Spamscatter [2]. 
They reported a false positive rate (EPR) of 14.8% and a false 
negative rate (ENR) of 8.9%. 

Another popular approach is to analyze the structural prop¬ 
erties of webpages, especially looking for known malicious 
patterns within the embedded javascript, to identify malicious 
sites that trigger drive-by-download attacks [11], [13], [16], 
[21], [30]. JSAND by Cova et al. [11] combines anomaly 
detection with emulation and uses a naive Bayes classifier. 
Out of around 800 malicious pages and scripts, they report 
a very low ENR of 0.2%, although they do not report the 
corresponding EPR. Cujo [30] by Rieck et al. considers both 
static and dynamic javaScript features and classifies websites 
using Support Vector Machines (SVM). They look at around 
220,000 benign websites and 600 drive-by-download attacks 
and report an average true positives rate (TPR) of 94.4% with 
a 0.002% EPR. They also ran JSAND on the same dataset and 
report a 99.8% TPR. Einally, ZOZZLE by Curtsinger et al. [13] 
considers over 1.2 million javascript samples and achieves EPR 
and TPR in the range of 1.2-5.1%. 

Closest to our work is Prophiler [9] by Canali et al., which 
also looks at identifying malicious websites, but using only 
static features related to the URL and the structure of a page. 
Eor each web page they consider 37 URL-based, 20 HTML 
and 26 JavaScript features and train three different classifiers, 
one for each feature type. They reported 0.77% false negatives 
and 9.88% false positives® with a dataset of 15,000 web pages. 

Below we summarize the main differences between our 
work and previous research. While systems like Prophiler [9] 
and JSAND [11] report good results for detecting malicious 
websites, we consider a much broader and non-trivial problem 
of predicting subjective rating dimensions like trustworthiness 
and child-safety of a website. In addition, we consider not 

''http://www.dmoz.org/ [Retrieved:April 21, 2015] 

^http://random.yahoo.com/bin/ryl/ [Retrieved: April 21, 2015] 

®Not to be confused with ENR or EPR; see Appendix D for exact definitions. 
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Fig. 2. (a) WOT user interface showing aggregated user ratings of a web 

page being poor (red) in both dimensions (trustworthiness and child safety). 
The confidence in the rating is represented by the number of dark figurines, 
(b) WOT divides the reputation rating range into five levels; The ratings of a 
webpage are indicated by a color-coded glyph representing the level. 


only the structure and URL of a web page but also the content 
presented on that page. To our knowledge, this is the first study 
combining static and content-based website features to predict 
the future reputation rating of a website. 



Fig. 3. An architectural overview of LookAhead in association with a 
crowdsourced web reputation system WOT. 


class(r) 


f bad if r <Th 
\good otherwise 


( 1 ) 


III. Web Reputation System WOT 

WOT is a web reputation system, that provides reputation 
ratings of the domain of a given URL in two dimensions, 
trustworthiness and child safety, as an integer in the range 
[0-100]. WOT builds the reputation ratings of a web domain 
mainly based on crowdsourced input ratings from a large user 
base and then applying a proprietary aggregation algorithm. 
Additionally, WOT uses input from other trusted sources, but 
the identities of these sources are not public. 

The user front-end of WOT is a browser extension that 
scans the page being rendered in the browser for URLs, looks 
up their reputation ratings in the WOT backend, and shows the 
results as color-coded glyphs. For example. Fig. 2(a) shows a 
red glyph next to a website deemed unsafe by WOT. The rating 
space is divided into five levels, with a color code assigned to 
each level, see Fig. 2(b). Clicking on the glyph brings out a 
pop-up window that shows more information about the rating. 
WOT’s confidence in a rating (which we believe correlates 
with the number of users who had given input ratings) is 
also indicated in this window, represented by a set of dark 
figurines (up to five). WOT has seen well over 100 million^ 
downloads. It is also used by large scale services like Facebook 
and Mail.ru. It is reasonable to assume that the user base of 
WOT and similar rating systems runs into tens of millions. 

Our objective is to see if we can use information found on a 
hitherto unrated webpage to predict what rating it will receive. 
In this paper we use WOT as the target reputation system. 
However, our proposed method is generic and would work 
with any web reputation system. We therefore use existing 
WOT ratings as the ground truth, and apply a supervised 
learning-based algorithm for model learning. Moreover, instead 
of building a regression model, we formulate the web page 
reputation prediction as a binary classification task [4]. The 
binary classification approach divides the reputation ratings 
into two (coarse) groups by applying a suitable threshold on the 
reputation ratings. This approach helps to minimize the effect 
of subjective variations among users in their ratings. Given a 
reputation rating r G [0,100] of a URL, the class information 
of the URL is computed using the following simple rule; 


^http://mywot.net [Retrieved: April 21, 2015]. 


In our experiments (Section VI) we present results for Th = 
40, while Appendix D contains results for Th = 60. 


IV. LookAhead; Predicting Safety Ratings 

The main objective of LookAhead is to extend the existing 
capabilities of a web reputation system, such as WOT, by 
predicting the eventual ratings of previously unrated websites. 
The predictive approach utilizes existing reputation ratings of 
a large number of webpages to learn a mapping function from 
various webpage features to a set of target classes, in our 
case, either good or bad (see Equation 1). Fig. 3 illustrates 
an overview of our web safety prediction approach combined 
with WOT. The LookAhead part, highlighted in the figure, is 
composed of a web crawler, a database, and a predictive model. 
The web crawler extracts various features from webpages 
and stores them, along with reputation ratings in various 
dimensions*, to a database. The predictive model is responsible 
for learning a classification model and also responsible for 
predicting web safety of unrated URLs. 

The efficacy of LookAhead relies mainly on the quality 
of the learned predictive model, as well as its generalizability 
to unforeseen URLs. As in any supervised machine learning 
approach, precursor to the model learning is a time consum¬ 
ing bootstrapping step [3]. A typical bootstrapping process 
involves obtaining suitable feature representation of the data, 
in our case websites, and collecting accurate ground truth 
labels. Often the feature extraction process is referred to as 
feature-engineering, as it relies on domain-specific expertise. 
The optimal feature set is often dependent on the target classes 
of interests and obtaining an optimal features set has been 
identified as a non-trivial problem [22]. In this work we con¬ 
sider two types of features to represent websites; (i) structural 
features, which are extracted from the HTML and embedded 
JavaScript codes, and (ii) content features that capture ratings 
of web links embedded in the page and the thematic structure 
of page text. 


^Reputation ratings in trustworthiness and child safety are 
obtained using API calls to the WOT system. For details see 
https://www.mywot.com/wiki/API [Retrieved: April 21, 2015]. 


















A. Structural Features of Web Pages 

For structural features, we mainly rely on past research that 
has identified and successfully validated a large set of features, 
based on HTML and embedded JavaScripts codes, to identify 
malicious webpages. Specifically, we adopt the handcrafted 
and domain specific features proposed by Canali et al, as 
part of their Prophiler system [9]. In the evaluation section 
(see Section V-B), we consider Prophiler as our main baseline 
algorithm. 

1) HTML-based Features: We adopt the same 20 HTML 
features® used by Prophiler. Examples include the number 
of iframe tags, the number of hidden elements, the number 
of script elements, the percentage of unknown tags, and the 
number of malicious patterns, e.g., presence of the meta 
tag [9]. 

2 ) JavaScript-based Features: We use the same 24 
JavaScript-based features used by Prophiler, which are ex¬ 
tracted by analyzing either the JavaScript file or the <script> 
element embedded within the HTML text. Examples of 
JavaScript-based features include the number of times the 
eval() function is used, the number of occurrence of the 
setTimeoutO and setlnterval() functions, the number of DOM 
modification functions, and the length of the script in charac¬ 
ters (see [9]). 
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Fig. 4. Overview of the embedded link-based feature extraction procedure. 
For a given webpage (on the left), we identify the set of out going web 
links (in the middle) present on the page and fetch the reputation ratings of 
those embedded links. We use the reputation ratings to compute an empirical 
cumulative distribution function (ECDF). 


ECDE-based feature extraction has been previously ex¬ 
plored in the field of ubiquitous computing and mobile sensing 
to represent human motion characteristics from continuous ac¬ 
celerometer data streams [3], [19], [28]. However, the method 
has attracted very little attention outside the sensing domain. 
The simplicity and fast computation time of the ECDE features 
make it a viable option for using it in static web page analysis. 
Contrary to mobile sensing, in this paper we primarily focus on 
discrete reputation ratings. Eig. 4 shows a pictorial overview 
of the basic idea of extracting ECDE-based features from a set 
of embedded web links. 


B. Content Features of Web Pages 

Contrary to the state-of-the-art approaches, in this paper 
we propose the use of a novel set of features based on 
(1) empirical cumulative distribution function (ECDE) of the 
reputation ratings of embedded forward links and ( 2 ) topic 
modeling. The main intuition behind using these features is 
that by learning (unsupervised) webpage content properties, 
we avoid the need for handcrafted features based on domain 
knowledge. In our evaluation, we show that the proposed novel 
features improve the recognition performance significantly (see 
Section VI). 

1) Embedded Link-based Features: To extract simple yet 
effective clues about the content of a web page, we hypothesize 
that the content of a page is related to the content of the pages 
it links to. In other words, we make use of the hypothesis; 
“Tom are the company you keep”. This saying is based on the 
fact that often knowledge about a unknown person’s friends 
provides some idea about the person’s interests or personality. 
Similar ideas have been successfully applied in recommender 
systems [7] and in detecting susceptibility of mobile devices 
for malware infections [35]. 

Building on this idea, we propose a feature extraction 
scheme utilizing the available reputation ratings of embedded 
links. However, web pages may contain an arbitrary number 
of embedded links, e.g., the number can vary between zero 
and a very large number (few hundreds). Moreover, the range 
of the reputation ratings can be arbitrary. Thus we need a 
feature representation scheme that can compactly represent an 
arbitrary number of outgoing links, while remaining robust in 
the face of arbitrary ranges of ratings. 


^Readers are refened to [9] for an exhaustive and in-depth description of 
all the FITML features. 


More formally, let R = {ri, r 2 ,..., r^} denotes the set of 
available reputation ratings of all the embedded web links on 
a page, where S I[o,ioo]j Vi G {1,..., n}. The ECDE Vc{r) 
of R can be computed as below: 

V,{r)=p[X<r), (2) 

where, p{X = r) is the probability of observing an embedded 
web link with a reputation rating of r and X is a random 
variable that takes values from R (uniformly at random). 
Eor example, Eigure 5(a) shows an exemplary histogram of 
reputation ratings of web links found within a web page and 
Eigure 5(b) shows the corresponding ECDE computed using 
Equation 2. Note that Vdr) is defined on the entire range 
of the reputation ratings for embedded web links and is a 
monotonically increasing function. 

Often the distribution of reputation ratings for embedded 
links is multimodal, e.g., as in our example shown in Eig¬ 
ure 5(a). In order to learn from such distributions, a recognition 
system should extract descriptors that relate to the shape and 
spatial position of the modes [19]. The shape of the distribution 
is captured as Vc increases from 0 to 1 (see Eigure 5(b)). To 
extract a feature vector / G from the distribution, we first 
divide the range of Vc, i-C-, [ 0 , 1 ], into k equally sized bins 
with centers respectively at [ 6 i, 62 ,. •., &fe]. The feature 
component /j G K is then computed as; 

h = Vf\bf) (3) 

Thus the feature vector / accurately captures the shape and 
positions from the underlying probability function p{r), while 
the ECDE Vc can be computed efficiently using Kaplan- 
Meier estimator [12]. Eor completeness, Eigure 5(c) shows the 
extracted ECDE-based feature vector for k = 75. The only 
parameter for the ECDE-based feature extraction method is the 
number of bins k, which controls the granularity with which 
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(a) Histogram showing the frequency distribu- (b) Empirical Cumulating Distribution Function (c) The ECDF-based features by computing the 
tion of reputation ratings of embedded web links (ECDF) computed from the reputation ratings. inverse, 
found within a web page. 

Fig. 5. Exemplary illustration of (a) the distribution of reputation ratings, (b) their Empirical Cumulative Distribution Function, and (c) ECDF-based features. 


the shape of the underlying distribution is captured. In our 
experiments we also append the mean of ratings in i? as a 
feature value to the extracted ECDF feature vector. 

Adversarial Implications: If ECDF features were based on 
all outgoing links, a malicious website may attempt to evade 
detection by embedding a large number of links to pages with 
high ratings. To deter such an attack, while constructing the set 
R (see above), we only allow ratings r < Cr, where Cr is the 
critical rating threshold. The choice of Cr can be application 
dependent and ideally should be adapted based on the overall 
costs of making false negative predictions. 

2) Topic Model-based Features: To gain further insight 
into the type of content on a web page, we analyze the 
text in the page and extract a set of features that captures 
the distribution over a set of topics. A topic is dehned as a 
probability distribution over a fixed set of words. In order to 
learn the topics in an unsupervised manner, we employ the well 
established Latent Dirichlet Allocation (EDA) model [6]. The 
main objective of EDA, or in general in any topic modeling 
algorithm, is to extract short descriptions of documents, while 
preserving statistical relationships that are useful, e.g., for 
document summarization and classihcation. In this work, we 
only focus on text in English. As a signihcant portion of the 
webpages in our evaluation dataset (see Section V-A) is non- 
english we use Google translation APIs, as part of the web 
crawler, to convert text into english. To avoid translation errors, 
we use an english dictionary to validate words before they are 
included in the vocabulary set V used by the EDA model. 

The main objective of the EDA model is to learn model 
parameters, such as K topics Pi-^k, the topic proportions 9d 
in the document d, and topic assignments Zn,d of observed 
word Wn in document d from the corpus of webpages. A 
brief overview of the topic model and the dehnitions of the 
parameters are given in Appendix A. Once the EDA parameters 
are learned, given the set of words w present on a webpage 
and the topics Pi-k, the topic model-based feature set for the 
webpage is computed as: p{9d\w, Pi.,k), i-C-, the estimated 
topic proportions. 

Adversarial Implications: Similarly to the ECDF-based fea¬ 
tures, the topic model-based features can be exploited by an 
adversary. As the topic proportion term p{9d\w^ Pi-.k) captures 
the relative weight of various topics being described within the 
text w, an attacker can simply add random words that can boost 


the probability of certain topics. In Section VII we propose a 
possible solution to prevent this attack. 

C. Ensemble Classification 

One challenge in the feature extraction procedures, de¬ 
scribed above, is that often one or more feature types are 
missing from a web page. For example, in reality, not all web 
pages use JavaScript, contain embedded forward links, or use 
textual descriptions, although the HTME features are always 
available. Thus, a new classihcation technique is required that 
is able to overcome the problem of feature unavailability. 
Existing approaches such as [9], [23], [33], [34], do not address 
this problem and therefore have limited generalizability. 

According to Bayesian theory [20], given HTME (fn), 
JavaScript (/j), ECDF (/s), and Topic (fr) feature vectors, 
a URE should be assigned to the class cj G {bad, good}, if 
the posterior probability for class Cj is maximum, i.e. 

assign URL —)■ cj if 

p{cj\fH, fj, fs: fr) = max p{ci\fH, fj, fs, fr) (4) 

I 

The computation of p(cj |//r,/j,/ b,/ r) depends on the 
joint probability functions (likelihood) p{fH,fj,fE,fT\cj) 
and the prior probability p{cj), i.e.: 

pi . cfifHjj , fE , fT ) ocp(/b,/j,/b,/t|cj)p(cj) (5) 

The likelihoods are difficult to infer when one or more features 
are unavailable. The likelihood computation can be simpli- 
hed by combining decision support of individual classihers 
on different feature types [20]. Accordingly, we train four 
classihers Cjj, Cj, Ce, and Ct using valid fn, fj, fE, and fx 
features respectively, where each classihers returns a posterior 
probability distribution over the bad and good classes. How¬ 
ever during prediction, if a feature type is unavailable, we do 
not include the corresponding classiher while computing the 
overall posterior probabilities. 

A number of strategies can be adopted to combine the 
posterior probabilities of the classihers to generate the overall 
belief. In this paper we propose a linear combination rule that 
determines the combination weights of individual classihers 
using the Fukunaga class separability score [18]. Our adaptive 
weight selection method is based on the intuition that a clas¬ 
siher should be given more importance if it is easy to separate 










Trustworthiness 



Maximum a posterior 
probability 


t 



Fig. 6. Overview of the ensemble classification approach used by LookAhead. 


among the bad and good classes in the corresponding feature 
space. See Appendix B for definition of class separability used 
in this work and other popular combination rules. For each 
classifier, we compute the separability score after correlation 
based feature subset selection. The separability scores, after 
normalization, are then used as the respective weight Wk for 
the classifier Ck- The final belief of the class Cj is estimated 
as: 

P*i.Cj\fHj.J,fE,fT)= X! Wkp{Cj\fk) (6) 

ke{E,J,E,T} 

The final predicted class Cj is inferred by applying the decision 
rule given in Equation 4 using the computed belief above. 
Fig. 6 shows the data adaptive ensemble classification tech¬ 
nique used by LookAhead. 
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Fig. 7. Histograms of all webpages in our dataset in two reputation 
dimensions: trustworthiness (top) and child safety (bottom). 


does not directly correspond with malware. In addition to the 
reputation ratings, WOT provides category information, such 
as ‘malware’, ‘scam’, ‘suspicious’ and ‘good site’, of websites 
based on votes from users and third parties. We filter the all- 
valid dataset and generate a malware dataset consisting of 
2, 784 webpages that were categorized by WOT as ‘malware 
or virus’ and contain all feature types. To generate a dataset 
containing both malware and benign webpages, we include an 
equal number of webpages that got very high trustworthiness 
ratings and have all feature types. We refer to this dataset as 
the malware dataset. 

Lastly, we construct another dataset by considering only 
the URLs that fall either in the top most or the bottom most 
trustworthiness rating categories, see Figure 2(b) for definitions 
of various rating categories used by WOT. As with malware 
dataset, we only consider webpages for which all feature types 
are available. Our two-category dataset consists of 10,118 very 
poor and 13, 539 excellent webpages. 


V. Experimental Settings 

A. Datasets 

To perform an extensive and systematic study, we generated 
a pool of over 140,000 URLs and obtained their reputation 
ratings in both dimensions using the WOT API. Out of these, 
80,000 URLs have positive reputation ratings, and 60, 000 
URLs have negative ratings. For each URL we crawl the web 
page to extract HTML, JavaScript, ECDF and topic model 
features where available. Figure 7 illustrates the histograms of 
reputation ratings for all webpages in our dataset. The dataset, 
where at least valid HTML features and WOT ratings are avail¬ 
able, is referred to as the opportunistic dataset. Out of 140, 000 
URLs, 89,220 web pages have trustworthiness ratings, and 
84, 714 web pages have ratings for child safety. However, the 
number drops to 31,995, in case of trustworthiness, and to 
38,118 for child safety, when validity of all feature types are 
considered (for 'Th = 40). We refer to this second dataset as 
the all-valid dataset. The significant drop in the size of the 
all-valid dataset further highlights that feature unavailability is 
intrinsic to web data analysis. 

Existing research primarily focussed on detecting if a 
webpage is malicious. However, the malware dataset used 
in [9] is no longer available, which makes exact replication 
of Prophiler results difficult. The definition of trustworthiness 


B. Baseline Algorithms 

In our experiments, we report comparison results against 
Prophier [9]. Prophiler relies on HTML, JavaScript, and 
URL/HOST features to detect if a webpage is malicious. How¬ 
ever, it uses APIs to a proprietary WHOIS [15] system and uses 
a private database for blacklisted URLs to extract URL/HOST 
features. Neither of these are available openly, which makes 
the corresponding URL/HOST feature vectors invalid for our 
datasets. Consequently, we do not use URL/HOST features in 
our ensemble classification system. 

Note that it is very easy to incorporate additional feature 
types to our classification system, e.g., training a classifier C 
using the new feature type and then considering the posterior 
probabilities in Equation 14. Contrary to our approach, i.e., 
assigning data driven weighting of classifiers to compute the 
final belief (see Section IV-C), Prophiler uses the ‘OR’ combi¬ 
nation rule (see Appendix C). We systematically compare the 
performance of LookAhead with the ensemble classification 
methodology considering different subsets of feature types. 

C. Evaluation Metric 

We use 10-fold cross validations when presenting classi¬ 
fication performance for all the approaches. As the primary 
performance metric, we use Avg. Fi-score, FNR, and FPR. 

























TABLE I. Performance of LookAhead (under 
VARIOUS feature COMBINATIONS) AND PROPHILER 
ON THE ALL-VALID DATASET. 


All-valid dataset size: 31,995 URLs, 1“n — 40, Cr — 40 
Reputation dimension: Trustworthiness 



All-valid dataset size: 38,118 URLs, T^h. = 40, Cr — 40 
Reputation dimension: Child safety 



**: Statistically significant with 99% confidence. 


The definitions of all the evaluation metrics are given in 
Appendix D. 

VI. Evaluation 

We begin our evaluation by first considering classification 
performance on the all-valid dataset using Random Forest as 
the basic classifier'®. Note that, all URLs considered within 
this dataset have valid HTML (H), JavaScript (J), ECDF (E) 
and Topic-based (T) features. This dataset allows us to system¬ 
atically study the influence of various feature combinations on 
the overall classification performance of LookAhead. Table I 
summarizes the performance of LookAhead in both reputation 
dimensions with the parametric settings 7”/,. = 40 (see Equa¬ 
tion 1), and Cr = 40 (see Section IV-B). Additionally, the 
table also includes the performance of Prophiler. 

For trustworthiness, LookAhead achieves the highest Avg. 
Fi-score of 81.3%, when all feature types are considered 
(highlighted in gray), at the same time achieving the lowest 
FNR (19%) and FPR (18.3%). Similarly for child safety, 
LookAhead with all feature types achieves the best perfor¬ 
mance (86.4%), lowest FNR (11.6%) and lowest FPR (16.2%). 

’®We also experimented using linear-SVM, SVM, KNN and C4.5 classifiers, 
and chose Random Forest for its superior performance. 


TABLE II. Performance of LookAhead on the 

OPPORTUNISTIC DATASET UNDER VARIOUS CLASSIHER 
COMBINATION RULES. 

Opportunistic dataset size: 89,220 URLs, 7“h. — 40 and Cr — 40 


Reputation dimension: Trustworthiness 


Experiment 

Comb. Rule | Balancing 

Avg. Fi-Score 

(%) 

FNR 

(%) 

FPR 

(%) 

Adaptive 


78.0 

56.4 

4.1 

Sum 


77.8 

57.9 

3.5 

Product 


78.9 ** 

53.3 

4.6 

Or 


32.4 ** 

10.1 

84.2 

Voting 


71.4 ** 

38.8 

25.2 

Prophiler* 


72.6 ** 

45.3 

19.8 

Adaptive 


74.0 

22.3 

29.1 

Sum 


74.0 

22.3 

29.0 

Product 

■/ 

73.6 ** 

22.8 

29.5 

Or 

■/ 

27.3 ** 

2.3 

89.8 

Voting 

■/ 

57.3 ** 

11.1 

57.0 

Prophiler* 


62.0 ** 

14.4 

49.8 


Opportunistic dataset size: 84,714 URLs, 7^^ = 40 and Cr — 40 
Reputation dimension: Child safety 


Experiment 

Avg. Fi-Score 

FNR 

FPR 

Comb. Rule 

Balancing 

(%) 

(%) 

(%) 

Adaptive 


83.7 

29.8 

7.2 

Sum 


83.4 * 

31.2 

6.6 

Product 


83.6 

29.6 

7.5 

Or 


40.7 ** 

4.7 

82.3 

Voting 


73.9 ** 

19.7 

30.7 

Prophiler* 


73.9 ** 

26.7 

26.2 

Adaptive 

-r 

81.5 

25.3 

14.0 

Sum 

■/ 

81.1 ** 

22.9 

16.4 

Product 

■/ 

80.9 ** 

23.1 

16.8 

Or 


40.0 ** 

3.0 

83.4 

Voting 


68.1 ** 

12.3 

44.3 

Prophiler* 


69.4 ** 

15.6 

40.5 


**: Statistically significant with 99% confidence. 
* : Statistically significant with 95% confidence. 


In both reputation dimensions, the performance using all 
features, is significantly better (statistically) than all other 
feature combinations, i.e., p ^ 0.01 in McNemar test with 
Yates’ correction [24]. 

Prophiler shows a statistically weaker classification perfor¬ 
mance in both reputation dimensions compared to LookAhead 
(employing all feature types). However, it achieves a better 
FNR in prediction than LookAhead. This is due to the use 
of a conservative ‘OR’ classifier combination rule (see Ap¬ 
pendix C) that is more likely to report a URL as bad. This 
higher likelihood of predicting webpages as bad improves 
the overall recall of the bad class, which consequently pulls 
down the FNR for Prophiler, however, at the expense of a 
higher FPR. Prophiler focuses solely on reducing FNR. In 
contrast, in use cases where overall usability in prediction is 
important, both FNR and FPR should be reduced. For example, 
in predicting safety ratings, a low FPR is also needed to avoid 
showing frequent warnings to users for actually good websites. 
A very similar classification performance (Table VI in the 
Appendix) is observed when the same set of experiments are 
conducted on the all-valid dataset for Th = 60, and Cr = 60- 

In reality, not all feature types are available for all URLs. 
To evaluate the performance of LookAhead under real life 
situations we next present results on the opportunistic dataset. 
In these experiments, we only present the performance of 
LookAhead while considering all available feature types. 
Moreover, we study the performance of various classifier 
combination rules and present the results in Table II for both 
reputation dimensions. In contrast to the all-valid dataset, 
Th = 40, generates a high degree of class imbalance in 




TABLE III. Performance of LookAhead and 
Prophiler on the malware and two-category 

DATASETS. 


Malware dataset size: 5,568 URLs 


Feature sets 

Avg. Fi-Score 

(%) 

FNR 

(%) 

FPR 

(%) 

H 

j 

E 

1 





89.0 

10.3 

11.6 


Prophiler | 80.7** | 11.1 | 2fli 


Tow-category dataset size: 23,657 URLs 


Feature sets 

Avg. Fi-Score 

(%) 

FNR 

(%) 

FPR 

(%) 

H 

j 

E 

1 





89.8 

13.9 

7.4 


Prophiler | 79.3 ** | 16.2 | 24.3 

**: Statistically significant with 99% confidence. 


our opportunistic dataset (see Figure 7). During the training 
phase the prevalence of one class affects the process of 
learning, and the learned classifier is often biased towards 
the over-represented class [25]. To mitigate class imbalances 
during training, we also report experimental results when a 
simple class balancing approach, i.e., reducing data from the 
prevalent class, is applied during classiher training. The data 
driven, adaptive classiheation combination rule of LookAhead 
generates the best classification performance, with a notable 
exception in the case of unbalanced dataset for trustworthiness, 
where the ‘Product’ rule achieves the highest Avg. Fy-Score. 

Prophiler has previously been shown to perform well in 
detecting malicious websites. To show how LookAhead (with 
all features) perform in such scenarios, we repeated the exper¬ 
iments on the malware and two-category datasets and present 
the results in Table 111. LookAhead achieves average Fy-scores 
of 89% for the malware dataset and 89.8% for the two- 
category dataset, which are significantly better (p ^ 0.01) than 
Prophiler’s performance of 80.7% for the malware dataset and 
79.3% for the two-category dataset. LookAhead also generates 
better FNR and FPR than Prophiler in both cases. 

VII. Discussion 

A. Feature Importance in Reputation Prediction 

Our results show that the structural and content related 
properties of a website can be effectively used to predict not 
only its maliciousness, but also the more challenging properties 
of trustworthiness and child safety. In order to understand 
the overall classification results, we study the importance" of 
individual features as computed by a Random Forest classiher. 
In Fig. 8 we plot the average" importance for all (120) features 
used in this work when training a Random Forest classiher 
(using 100 trees) on the all-valid dataset (7 ~h = 40, Cr = 40). 
The higher the value, the more important is the feature. Fig. 8 
further highlights that different features are assigned different 
relative importances while separating good websites from bad 
ones in each reputation dimension. 

Interestingly, the importance scores of the HTML and 
JavaScript-based features look very similar for both trust¬ 
worthiness and child safety predictions. The most important 
features, shown by the dotted region A in the hgure, are related 

"Feature importance is defined as the total decrease in node impurity 
averaged over all the trees [8]. 

^^Over 10 folds. 


Trustworthiness 



20 40 60 80 100 120 


Features 



20 40 60 80 100 120 


Features 


Fig. 8. Importance of individual features, while predicting trustworthiness 
and child safety, computed by the Random Forest classifier on the all-valid 
dataset. 


to script tags in HTML, direct assignments in JavaScript, and 
the total character count in both. Although a few structural (i.e., 
HTML and JavaScript) features are found to be important, a 
majority of them have little or no signihcance. Contrary to the 
structural features, ECDF features show significant differences 
in importance scores for the two reputation dimensions. For 
trustworthiness, low ratings of the embedded forward links 
(region B) play an important role in prediction. In child safety, 
the mean value of the embedded ratings (region C) plays 
a signiheant role also. For trustworthiness, the three most 
important topics (region E) are related to money-making, news, 
and weather. Among the rest of the topic features, none are 
signihcantly better or worse than the others. Eor child safety 
prediction there are three other topics (region D) that play a 
signiheant role and as expected, these topics correspond to 
adult content. 

Although, we use the same feature set for predicting 
both reputation dimensions, the feature selection inherent to 
the Random Eorest classiher learns very different mapping 
functions for each prediction task. Eig. 8 provides evidence 
that our proposed ECDE and Topic-based features contribute 
consistently in predicting subjective ratings of web pages. 

B. Tuning of Prediction Performance 

Predictive performance of LookAhead can be primarily 
inhuenced by a number of factors: (i) the type of features con¬ 
sidered (e.g., HTML and ECDE), (ii) the type of classiher used 
to train on individual feature dimensions (e.g.. Random Eorest 
and SVM), (iii) strategies used to overcome class imbalances 
in the training data, and (iv) the combination rule used for 
computing the hnal posterior probability (e.g., Adaptive and 
Sum rule). Often, once the prediction pipeline is deployed, the 
factors (i)-(iii) are kept constant, as they are time consuming 
to re-build. However, the classiher combination strategy can be 
adapted in real time to control the overall performance of the 
LookAhead system. Based on the requirements, the system 
administrator can focus more on lowering the overall ENR 
by using the ‘OR’ combination rule, e.g., while predicting 
child safety a very low ENR is expected for parental hltering 
systems. As evident from Table II, often emphasizing ENR 



































































TABLE IV. 


Detection rates for various classifiers settings. 


Time Analysis for Fetching Various Features. 


Dataset 

System 

FNR 

(%) 

FPR 

(%) 

Vr 

(%) 

All-valid, TW 

LookAhead 

19.0 

18.3 

52.5 

All-valid, TW 

Prophiler 

14.2 

35.9 

37.4 

All-valid, CS 

LookAhead 

11.6 

16.2 

57.5 

All-valid, CS 

Prophiler 

9.6 

34.5 

39.6 

Opportunistic, TW 

LookAhead 

22.3 

29.1 

40.0 

Opportunistic, TW 

Prophiler 

14.4 

49.8 

30.1 

Opportunistic, CS 

LookAhead 

25.3 

14.0 

51.2 

Opportunistic, CS 

Prophiler 

15.6 

40.5 

34.3 


inflates FPR. Our LookAhead system demonstrates a good 
balance of both FNR and FPR. 


C. Detection Rate 


When considering the implications of our results, in ad¬ 
dition to the FNR and FPR, the proportion of good and bad 
websites in the wild should also be taken into account. In 
reality, this so-called base rate Br, is biased towards good 
websites. Thus we look at the detection rate for bad websites, 
i.e., what percentage of webpages that our classifier predicts 
as bad are truly bad. From WOT statistics [31], we see that 
roughly 20% of websites that have a rating are dangerous 
regarding either trustworthiness or child safety. We use this 
number as our estimate for Br, and compute the detection 
rate as: 


(1 - FNR) ■ Br 

(1 - FNR) ■ Br F FPR ■ (1 - Br) ’ 


(7) 


Table V presents detection rates for our different classi¬ 
fication scenarios. The datasets are presented for both trust¬ 
worthiness (TW) and child safety (CS). We can see that 
due to the biased base rate, the detection rates are in the 
range of 30-40% for Prophiler and 40-57% for LookAhead. 
Still, the better classification performance of LookAhead over 
Prophiler is apparent here too. For example, if we consider 
a warning system for users, we can see that, in the case of 
all features being present, 52.5% of possible warnings for 
untrustworthy webpages would be correct for LookAhead, 
with the rest being false alarms. The corresponding detection 
rate of 37.4% for Prophiler is significantly lower. This shows 
that while in general the problem of predicting a reputation 
rating is challenging due in part to the biased base rate, 
considering content-based features significantly increases the 
detection rate. 


D. Applications 

We see two potential uses for LookAhead: 

• fast-tracking publication of ratings: Crowdsourced 
reputation rating services like WOT do not announce 
a rating for a web site until they have enough input 
ratings to reach a sufficient level of confidence. If 
a partially accumulated rating (that has not reached 
a sufficient level of confidence) matches the rating 
predicted by our classifier, the reputation service may 
choose to fast-track the publication of the rating. 

• intermediate user feedback: If a user attempts to 
navigate to an unrated page that is predicted by our 
classifier to have a potentially bad rating, the browser 
extension can warn the user accordingly. 


TABLE V. 


Feature type 

Average fetch time 

HTML + JavaScript 
ECDF 

Topic + translation 

Topic + without translation 

3.1 s / link 

1.9 s / link 

3.4 s / link 

1.3 s / link 


Earlier research [26] raised concerns about the usefulness of 
crowdsourcing for security and privacy applications. Neverthe¬ 
less, given the popularity of systems like WOT, we argue that 
a tool like LookAhead is essential for the security of users 
who have chosen to rely on such systems. Also, note that 
although our analysis was done with WOT as the target rating 
system, the methodology is applicable to any website safety 
rating system, whether crowdsourced or expert-rated. 


E. Performance considerations 

We summarize the performance of our various feature 
extraction techniques and report the average measured running 
time needed for computing them. For the purpose of computing 
the average extraction time we randomly selected 1,000 URLs 
from our dataset and measured the time required to extract 
different classes of features on a standard Linux desktop 
computer (8 Gb RAM, 2.4 GHz processor) for the corre¬ 
sponding pages. In the case of Topic model features we also 
recorded the time for performing translation of non-english 
web pages. Table V summarizes the time analysis of our 
feature extraction methods. The time of 3.1 s that LookAhead 
needs for extracting structural features is comparable to that of 
3.06 s reported by Prophiler. When including the content based 
features, in total, LookAhead needs 6.3 s to extract all features 
from an English-language web page (and 8.4 s if translation 
is needed). 


F. Limitations 

Perhaps the most significant limitation of any system using 
machine learning to detect bad websites is the potential for 
adversaries to manipulate the system: either by modifying their 
website to avoid detection or by manipulating the classifier 
itself. While the use of the ECDE-function protects against 
manipulation of outgoing links, as we pointed out in Sec¬ 
tion IV-B2, the simplistic approach of using topic modeling is 
vulnerable to an attacker who attempts to influence the inferred 
topic model for a page he controls. Instead of directly using the 
probability distribution of topics as we do in Section IV-B2, 
we could convert to a boolean vector (indicating if the topic is 
present on the page). Such an approach will reduce EN (since 
an attacker can no longer gain by adding text to his page to 
make it appear to belong to an innocuous topic as the dominant 
topic), but will also raise EPs. We are currently investigating 
this avenue. 

Another limitation is that, although the performance of 
LookAhead is comparable to previous solutions, real time use 
will require further speedup. One option here is to use server- 
side assisted feature extraction. Einally, an open question is 
how the use of predicted ratings will influence the actual rating. 
Eor example, if the predicted rating is used for intermediate 
user feedback as suggested above, it might sway future input 
ratings from the crowd towards the predicted rating. 




G. Current Work 

We are conducting a longitudinal study on a large number 
of websites that do not yet have a WOT rating. We plan to 
see (a) how well our predictions match those websites that do 
eventually get a rating and (b) how do our predictions as well 
as the actual ratings evolve over time. 
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Appendix 


A. Topic Modeling 



Fig. 9. Graphical model representing the LDA used to analyze text form 
web pages. The boxes represent replications, the shaded circle is the observed 
variable (word) and unshaded circles are unobserved variables. 


A graphical representation of the model that we use in our 
analysis of web texts is shown in Figure 9. LDA follows a 
probabilistic generative modeling approach using the bag-of- 
words assumption, i.e., the order of occurrence of words within 
a document is ignored. Under the LDA model, each web page 
or document d G D is represented as a mixture of k latent 
(unobserved) topics denoted by the multinomial variable 9d 
(topic proportion of the document), where each topic /?fe 
is a distribution over the set of words or vocabularies V and 
is sampled from a Dirichlet distribution with parameter 77 . 9d 
is also drawn from a Dirichlet distribution with parameter a. 
Zn,d denotes the topic-assignment for the observed word Wn,d, 
which is sampled from 9d. Each word Wn,d depends on the 
topic-assignment variable Zn^d all the topic distributions 
{Pk}k=i- joint probability distribution can be written 
as [5]: 


Sl-D,Zl:D, Wi-d) 
K D 

Y[pii^k\v)Y[pidd\ot) 

k—l d—1 


N 


p{Zn,d\dd)p{Wn,d\Zn,d, Pi-.k) 


( 8 ) 


The main task of the topic model is thus to infer the 
parameters 13k, 9d and Zn,d from the corpus of text, i.e., wi-,d- 
The posterior distribution can be written as: 


p{(3l-.K,di-.D,Zi,D\wi-.D) 


p{I3i-.k,9i-.d,zi,d,wi.,d) 

p{wi,d) 


(9) 


Although, there are variational algorithms proposed in the 
literature for estimating the posterior distribution given in 
Equation 9, we use a Gibbs sampling-based approach to 
efficiently approximate it. Once the model parameters are 
estimated, for a given web page or document d containing 
the set of words'^ w, we use p{9d\w, I3 i,k), i-e-. the estimated 
topic proportion as the feature set for capturing the thematic 
content of the document. 


With topic modeling we gain insights into the basic 
composition of the web content present in our opportunistic 
dataset. The most frequent topics encountered were related to 
commenting and sharing, adult content, financial topics, and 
gaming. The most frequently occurring words are visualized 
in Eig. 10, which is created using the online tool Voyant [36]. 
In the figure the size of individual words reflects its relative 
number of occurrences in our sample set of web pages. 
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Fig. 10. Textual summary of the most frequent words present in our 
opportunistic dataset, where the size indicated the frequency of the words 
in the web text. 



Fig. 11. Average classifier combination weights used by LookAhead on the 
all-valid dataset. 


B. Fukunaka Class Separability 


The Eukunaga score relies on computing both the within- 
class and between-class scatter matrices. Let rii be the number 
of samples of the feature set belonging to class q of C different 
classes. Eurther, let pi and p respectively be the mean for class 
Ci and the global mean. The within-class and between-class 
scatter matrices Sw and Sb can be computed as follows: 


Sw 


Sb 


c 

E 

i=l 

C 





- /x)(/r, - pY 

i=l 



( 10 ) 

( 11 ) 


Then the Eukunaga class separability score is computed as: 


Separability = Tr{SB / Sw) ( 12 ) 


To illustrate the effect of classification combination strategy, 
Eig. 11 shows the average (computed over 10 folds) relative 
weights, computed using Eukunaga class separability, as used 
by LookAhead for the all-valid dataset. 


C. Classifier Combination Rules 

1) Sum Rule: 

P*iCj\fH,fj,fE,fT)= F(cjj/fe) (13) 

kG{H,J,E,T} 

2) Product Rule: 

P*{cfifH,fj,fE,fT)= n F(cjj/fc) (14) 

k^{H,J,E,T} 


*^As a common pre-processing step, we remove all frequently occurring 
words or stop words (in the language) from a document. 


3) OR Rule: If at least one of the constituent classifiers 
predicts a URL to be bad, then the overall prediction is bad. 














TABLE VI. Performance of Lookahead, 

UNDER various FEATURE COMBINATIONS, AND 

Prophiler on the all-valid dataset. 


All-valid dataset size: 43,675 URLs, T^h. = 60, Cr — 60 
Reputation dimension: Trustworthiness 


Feature sets 

Avg. Fi-Score 

FNR 

FPR 

H 

j 

E 

T 

(%) 

(%) 

(%) 





75.7 ** 

23.6 

24.9 





74.9 ** 

25.1 

25.1 





70.5 ** 

29.0 

29.9 





74.9 ** 

24.5 

25.6 





76.8 ** 

22.6 

23.8 





77.2 ** 

24.0 

21.6 





78.7 ** 

20.1 

22.3 





74.1 ** 

26.4 

25.5 





77.8 ** 

21.7 

22.7 




■/ 

79.1 ** 

21.1 

20.6 





79.2 ** 

21.7 

19.8 




■/ 

79.6 ** 

19.5 

21.2 





81.7 ** 

18.0 

18.7 




■/ 

80.9 ** 

19.1 

19.1 





82.4 

17.3 

17.9 


Prophiler | 75.3 ** | 15.0 | 3^ 


All-valid dataset size: 42,334 URLs, Th — 60, Cr — 60 
Reputation dimension: Child safety 


Feature sets 

Avg. Fi-Score 

FNR 

FPR 

H 

j 

E 

T 

(%) 

(%) 

(%) 





79.2 ** 

16.7 

26.4 





78.7 ** 

16.5 

27.9 





73.4 ** 

22.4 

32.4 





80.9 ** 

17.9 

20.8 





79.9 ** 

15.7 

26.0 





79.0 ** 

17.4 

25.8 




■/ 

83.2 ** 

14.7 

19.7 





76.8 ** 

19.4 

28.5 





82.8 ** 

15.6 

19.6 



/ 

■/ 

83.1 ** 

15.8 

18.6 



/ 


81.4 ** 

14.8 

23.6 




■/ 

84.0 ** 

13.4 

19.6 





84.8 ** 

13.5 

17.5 




■/ 

84.3 ** 

14.3 

17.8 





85.3 

12.8 

17.4 


Prophiler | 78.4 ** | 10.5 | 35.8 

**: Statistically significant with 99% confidence. 


TABLE VII. Performance of LookAhead on the 
OPPORTUNISTIC dataset. 


Opportunistic dataset size: 89,220 URLs, 7^^ = 60 and Cr — 60 


Reputation dimension: Trustworthiness 


1 Experiment 

Avg. Fi-Score 

FNR 

FPR 

Comb. Rule 

Balancing 

(%) 

(%) 

(%) 

Adaptive 


80.3 

33.9 

10.5 

Sum 


79.6 * 

38.0 

8.7 

Product 


80.0 * 

36.0 

9.6 

Or 


44.3 ** 

5.0 

77.6 

Voting 


71.4 ** 

23.6 

32.4 

Prophiler* 


J ** 

30.5 

27.5 

Adaptive 

/ 

77.1 

20.0 

25.2 

Sum 


77.0 

19.4 

25.7 

Product 

v 

76.6 ** 

20.0 

26.1 

Or 


40.5 ** 

1.9 

82.4 

Voting 

■/ 

64.2 ** 

10.5 

50.3 

Prophiler* 


65.8 ** 

14.6 

46.2 


Opportunistic dataset size: 84,714 URLs, 7^^ = 60 and Cr — 60 


Reputation dimension: Child safety 


1 Experiment 

Avg. Fi-Score 

FNR 

FPR 

Comb. Rule 

Balancing 

(%) 

(%) 

(%) 

Adaptive 


82.6 

26.8 

10.0 

Sum 


82.4 * 

28.7 

8.9 

Product 


82.3 * 

27.4 

10.1 

Or 


45.4 ** 

4.4 

80.0 

Voting 


73.0 ** 

17.7 

34.0 

Prophiler* 


72.5 ** 

23.9 

30.4 

Adaptive 


80.8 

25.9 

14.1 

Sum 


80.6 * 

23.6 

16.3 

Product 

■/ 

80.3 ** 

23.7 

16.7 

Or 

■/ 

45.3 ** 

3.6 

80.3 

Voting 

■/ 

69.3 ** 

13.0 

43.3 

Prophiler* 


69.7 ** 

16.0 

40.7 


**: Statistically significant with 99% confidence. 
* : Statistically significant with 95% confidence. 


where, i^^-score is the Fi-score of the class and Wi is the 
number of samples of class i in the test dataset. Additionally, 
we report the FNR and FPR, computed from the confusion 
matrix as: 


4) Majority Voting: The final classification is based on the 
majority voting of the constituent classifiers. Ties are broken 
randomly. 


FNR = 


FN 

FN FTP ’ 


FPR = 


FP 

FP + TP" 


E. Results 


(19) 


D. Evaluation Metrics 


In our experimental evaluations we use Fi-score (expressed 
as a percentage) as the main performance indicator, which is 
computed as: 


Fi-score 
precision 
and recall 


2 • precision • recall 

-, where 

precision + recall 

TP 

TP+ FP" 

TP 

TP+ FN' 


(15) 

(16) 
(17) 


Here, TP = True positives, FP = False positives, and FN = 
False negatives. 


In line with standard practices [3], [32], to overcome 
non-uniform class distribution in the test dataset, we use the 
weighted average of the individual Fi-score of all classes. 




Similarly to the Section VI, for completeness, we present 
performance of LookAhead and Prophiler on the all-valid 
dataset for Th, = 60, = 60 in Table VI. Note that, a 

higher value of Cr allows more URLs to have valid ECDF 
features and thus increases the dataset sizes in both dimen¬ 
sions compared to the results presented in Table 1. Table VII 
presents the performance of LookAhead and Prophiler on the 
opportunistic dataset with = 60, = 60, while using 

various classifier combination rules. 


Avg. Fi-score 


( 18 ) 




